On Bias-free Crawling and Representative Web Corpora
نویسنده
چکیده
In this paper, I present a specialized opensource crawler that can be used to obtain bias-reduced samples from the web. First, I briefly discuss the relevance of bias-reduced web corpus sampling for corpus linguistics. Then, I summarize theoretical results that show how commonly used crawling methods obtain highly biased samples from the web. The theoretical part of the paper is followed by a description my feature-complete and stable ClaraX crawler which performs so-called Random Walks, a form of crawling that allows for bias-reduced sampling if combined with methods of post-crawl rejection sampling. Finally, results from two large crawling experiments in the German web are reported. I show that bias reduction is feasible if certain technical and practical hurdles are overcome. 1 Corpus Linguistics, Web Corpora, and Biased Crawling Very large web corpora are necessarily derived from crawled data. Such corpora include COW (Schäfer and Bildhauer, 2012), LCC (Goldhahn et al., 2012), UMBC WebBase (Han et al., 2013), and WaCky (Baroni et al., 2009). A crawler software (Manning et al., 2009; Olston and Najork, 2010) recursively locates unknown documents by following URL links from known documents, which means that a set of start URLs (the seeds) has to be known before the crawl. Diverse crawling strategies differ primarily in how they queue (i. e., prioritize) the harvested links for download. A typical real-world goal is to optimize the queueing algorithm in a way such that many good corpus documents are found in the shortest possible time, in order to save on bandwidth and processing costs (Suchomel and Pomikálek, 2012; Schäfer et al., 2014). Such an efficiency-oriented approach is reasonable if corpus size matters most. However, the goals of corpus construction might be different for many corpora intended for use in corpus linguistics. Especially in traditional corpus linguistics, where forms of balanced or even representative corpus design (Biber, 1993) are sometimes advocated as the only viable option, web corpora are often regarded with reservation, partly because the sources from which they are compiled and their exact composition are unknown (Leech, 2007). Other corpus linguists are more open to web data. For example, in branches of cognitively oriented corpus linguistics where the corpus-as-input hypothesis is adopted—e. g., Stefanowitsch and Flach (2016 in press)—, nothing speaks against using large web corpora. Under such a view, corpora are seen as reflecting an average or typical input of a language user. Consequently, the larger and thus more varied a corpus is, the better potential individual differences between speaker inputs are averaged out. Even under such a more open perspective, corpus designers should make sure that the material used for a web corpus is not heavily biased. Naive crawling can lead to very obvious biases. For example, Schäfer and Bildhauer (2012, 487) report that in two large-scale crawls of the .se top-level domain, the Heritrix crawler (Mohr et al., 2004) ended up downloading 75% of the total text mass that ended up in the final corpus from a single blog host. The final corpus was still 1.5 billion tokens large, and seemingly large size does thus not prevent heavy crawling bias in web corpora, as the Swedish web most certainly does not consist of 75% blogs. Apart from such immediately visible problems (which, admittedly, can be solved by relatively simple countermeasures) there are structural and
منابع مشابه
Efficient Web Crawling for Large Text Corpora
Many researchers use texts from the web, an easy source of linguistic data in a great variety of languages. Building both large and good quality text corpora is the challenge we face nowadays. In this paper we describe how to deal with inefficient data downloading and how to focus crawling on text rich web domains. The idea has been successfully implemented in SpiderLing. We present efficiency ...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملBuilding general- and special-purpose corpora by Web crawling
The Web is a potentially unlimited source of linguistic data; however, commercial search engines are not the best way for linguists to gather data from it. In this paper, we present a procedure to build language corpora by crawling and postprocessing Web data. We describe the construction of a very large Italian general-purpose Web corpus (almost 2 billion words) and a specialized Japanese “blo...
متن کاملComparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them
Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domainspecific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms usin...
متن کاملPractical Web Crawling for Text Corpora
SpiderLing—a web spider for linguistics—is new software for creating text corpora from the web, which we present in this article. Many documents on the web only contain material which is not useful for text corpora, such as lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016